Fix Python UDAF list-of-timestamps return by enforcing list-valued scalars and caching PyArrow types by kosiew · Pull Request #1347 · apache/datafusion-python

kosiew · 2026-01-20T05:18:41Z

Which issue does this PR close?

Closes Cannot do udaf that returns list of timestamps #1339.

Rationale for this change

Python UDAFs that logically return a list (e.g., “collect all timestamps for a group”) were easy to implement incorrectly by returning a pyarrow.Array (or ChunkedArray) directly from Accumulator.evaluate(). In that case, DataFusion’s Python ↔ Rust conversion path could attempt to treat the array object as an integer-like scalar, leading to confusing conversion failures such as:

ArrowTypeError: object of type <class 'pyarrow.lib.TimestampArray'> cannot be converted to int

To make list-returning UDAFs work reliably (and make the contract explicit), we now ensure the UDAF result is a list-valued scalar that matches the declared Arrow return type.

What changes are included in this PR?

Documentation & API contract clarification
- Added a FAQ entry explaining how to return a list from a UDAF (return a list-valued pa.scalar and declare list types for return_type and state_type).
- Expanded Accumulator.evaluate() docstring to explicitly state that returning a pyarrow.Array is not supported unless converted to a list-valued scalar.
Runtime behavior improvements in the Rust accumulator bridge
- In RustAccumulator.evaluate, detect when the declared return type is a list (List, LargeList, FixedSizeList) and the Python evaluate() result is pyarrow.Array/pyarrow.ChunkedArray.
- Convert such array-like values into a list-valued pyarrow.scalar(..., type=<declared list type>) before extracting the scalar value for DataFusion.
- Cache the pyarrow.Array and pyarrow.ChunkedArray Python types inside the accumulator to avoid repeated pyarrow imports/type lookups on every evaluate call.
Tests
- Added a regression test (test_udaf_list_timestamp_return) covering a UDAF that returns a list of timestamp(ns) values.
- Added a small CollectTimestamps accumulator used by the test that maintains list state as a list-valued scalar and returns a list-valued scalar from evaluate().

Are these changes tested?

Yes.

Added a new regression test verifying that a UDAF can return list<timestamp(ns)> and that the collected results match the expected list-of-timestamps output.

Are there any user-facing changes?

Yes.

Docs: Added guidance in the UDF/UDAF user guide on returning lists from UDAFs.
Behavior: If a user returns a pyarrow.Array/ChunkedArray from evaluate() for a list-typed UDAF, it is now converted into the correct list-valued scalar automatically (rather than failing with a type conversion error).
No breaking API changes are intended; this clarifies and enforces the existing contract that evaluate() must return a scalar.

LLM-generated code disclosure

This PR includes code, comments generated with assistance from LLM. All LLM-generated content has been manually reviewed and tested.

Store UDAF return type in Rust accumulator and wrap pyarrow Array/ChunkedArray returns into list scalars for list-like return types. Add a UDAF test to return a list of timestamps via a pyarrow array, validating the aggregate output for correctness.

Add documented list-valued scalar returns for UDAF accumulators, including an example with pa.scalar and a note about unsupported pyarrow.Array returns from evaluate(). Also, introduce a UDAF FAQ entry detailing list-returning patterns and required return_type/state_type definitions.

…ype checking for list types

…nbinding and binding fresh copies when checking array-likeness, eliminating the Bound reference error

…amp_return

timsaucer · 2026-02-04T21:19:18Z

Sorry it's taken me a while to get around to this PR. It feels like we are doing two different things

telling users that they need to return pyarrow scalars as the return of evaluate
detect when it is a list and then we convert it to a python value and back into a pyarrow scalar

It feels like this isn't the best option. I think we want to avoid doing any kind of to_pylist() calls.

I think a more general solution would be something like

Determine if they have passed in a pyarrow scalar value. If so, use it.
If they have not passed in a pyarrow scalar value, use py_obj_to_scalar_value to convert to a scalar value
Update py_obj_to_scalar_value to detect pyarrow arrays and convert them to ScalarValue::List

For the last part we could do something like

    if obj.hasattr("__arrow_c_array__")? {
        let array_data = ArrayData::from_pyarrow_bound(&obj)?;

        let array = make_array(array_data);

        // ScalarValue::ListArray must be a list of length 1
        let offsets = OffsetBuffer::new(ScalarBuffer::from(vec![0, array.len() as i32]));
        let list_array = Arc::new(ListArray::new(
            Arc::new(Field::new_list_field(array.data_type().clone(), true)),
            offsets,
            array,
            None,
        ));

        return Ok(ScalarValue::List(list_array));
    }

Additionally, if we're going to go down this route I think we would want to treat both the state() and evaluate() since both of them should be returning scalars.

An advantage of the point described above is that I think it adds more flexibility to the users because their python functions can just return python integers and such without having to convert them to pyarrow scalars.

What do you think?

timsaucer · 2026-02-04T23:38:58Z

One problem I see with my answer above ^ is that some libraries like nanoarrow DO implement __arrow_c_array__ on a scalar value, and we wouldn't want to accidentally turn that into a ScalarValue::List.

kosiew added 7 commits January 20, 2026 13:16

Fix pyarrow calls and improve type handling in RustAccumulator

21906bb

Refactor RustAccumulator to support pyarrow array types and improve t…

7f363a7

…ype checking for list types

Fixed PyO3 type mismatch by cloning Array/ChunkedArray types before u…

5271ba2

…nbinding and binding fresh copies when checking array-likeness, eliminating the Bound reference error

Add timezone information to datetime objects in test_udaf_list_timest…

9c59258

…amp_return

clippy fix

6742954

kosiew marked this pull request as ready for review January 27, 2026 03:08

timsaucer mentioned this pull request Feb 3, 2026

Release DataFusion-Python 52.0.0 #1364

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Python UDAF list-of-timestamps return by enforcing list-valued scalars and caching PyArrow types#1347

Fix Python UDAF list-of-timestamps return by enforcing list-valued scalars and caching PyArrow types#1347
kosiew wants to merge 7 commits intoapache:mainfrom
kosiew:typeconversion-issue-1339

kosiew commented Jan 20, 2026

Uh oh!

timsaucer commented Feb 4, 2026 •

edited

Loading

Uh oh!

timsaucer commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kosiew commented Jan 20, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

LLM-generated code disclosure

Uh oh!

timsaucer commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timsaucer commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

timsaucer commented Feb 4, 2026 •

edited

Loading